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Abstract 

The numerical solution of partial differential equations on high-dimensional domains gives 
rise to computationally challenging linear systems. When using standard discretization tech¬ 
niques, the size of the linear system grows exponentially with the number of dimensions, making 
the use of classic iterative solvers infeasible. During the last few years, low-rank tensor ap¬ 
proaches have been developed that allow to mitigate this curse of dimensionality by exploiting 
the underlying structure of the linear operator. In this work, we focus on tensors represented 
in the Tucker and tensor train formats. We propose two preconditioned gradient methods 
on the corresponding low-rank tensor manifolds: A Riemannian version of the preconditioned 
Richardson method as well as an approximate Newton scheme based on the Riemannian Hes¬ 
sian. For the latter, considerable attention is given to the efficient solution of the resulting 
Newton equation. In numerical experiments, we compare the efficiency of our Riemannian 
algorithms with other established tensor-based approaches such as a truncated preconditioned 
Richardson method and the alternating linear scheme. The results show that our approximate 
Riemannian Newton scheme is significantly faster in cases when the application of the linear 
operator is expensive. 

Keywords: Tensors, Tensor Train, Matrix Product States, Riemannian Optimization, Low 
Rank, High Dimensionality 

Mathematics Subject Classifications (2000): 65F10, 15A69, 65K05, 58C05 


1 Introduction 

This work is concerned with the approximate solution of large-scale linear systems Ax = f with 
A G K”^". In certain applications, such as the structured discretization of d-dimensional partial 
differential equations (PDEs), the size of the linear system naturally decomposes as n = nin 2 • • - nd 
with € N for /X = 1,..., d. This allows us to view Ax = / as a tensor equation 

yix = F, (1) 

where F,X e K"ixn 2 x -xnd of order d and Al is a linear operator on K”ixn 2 x -xnd_ 

The tensor equations considered in this paper admit a decomposition of the form 

A = C + V, (2) 

where £ is a Laplace-like operator with the matrix representation 

L = Im® ■ ■ ■ ® In2® Li+ ® L 2 ® Ini^ - VLd® In^-i ® ' ' ' ® h, (3) 
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with matrices G and identity matrices The term V is dominated by C in the sense 

that C is assumed to be a good preconditioner for A. Equations of this form arise, for example, 
from the discretization of the Schrodinger Hamiltonian im, for which C and V correspond to 
the discretization of the kinetic and the potential energy terms, respectively. In this application, A 
(and thus also L^) is symmetric positive definite. In the following, we restrict ourselves to this case, 
although some of the developments can, in principle, be generalized to indefinite and nonsymmetric 
matrices. 

Assuming A to be symmetric positive definite allows us to reformulate Q as an optimization 
problem 


min 

XgR"lX-.Xr.rf 


i(X,AX)-(X,F) 


(4) 


It is well-known that the above problem is equivalent to minimizing the A-induced norm of the error 
||X-A-iF|U. Neither Q nor Q are computationally tractable for larger values of d. During the 
last decade, low-rank tensor techniques have been developed that aim at dealing with this curse of 
dimensionality by approximating F and X in a compressed format; see |18[ I20j for overviews. One 
approach consists of restricting @ to a subset M. C K"i><" 2 x---xnd compressed tensors: 


min /(X) :=^(X,AX)-(X,F). 


(5) 


Examples for Ad include the Tucker format [33131], the tensor train (TT) format (SD), the matrix 
product states (MPS) format |3] or the hierarchical Tucker format [T3[ll]- Assuming that the 
corresponding ranks are fixed. Ad is a smooth embedded submanifold of K"iX" 2 x--xnd each 
of these formats |25l EEl IMIEI]. This property does not hold for the CP format, which we will 
therefore not consider. 

When Ad is a manifold, Riemannian optimization techniques [T] can be used to address ([^ . In a 
related context, first-order methods, such as Riemannian steepest descent and nonlinear CG, have 
been successfully applied to matrix completion li 1331113IM] and tensor completion [13133 E3ISS]. 

Similar to Euclidean optimization, the condition number of the Riemannian Hessian of the ob¬ 
jective function is instrumental in predicting the performance of first-order optimization algorithms 
on manifolds; see, e.g., 


Thm. 2] and [3 Thm. 4.5.6]. As will be evident from (28) in (4.1 


an 

ill-conditioned operator A can be expected to yield an ill-conditioned Riemannian Hessian. As this 
is the case for the applications we consider, any naive first-order method will be prohibitively slow 
and noncompetitive with existing methods. 

Eor Euclidean optimization, it is well known that preconditioning or, equivalently, adapting 
the underlying metric can be used to address the slow convergence of such first-order methods. 
Combining steepest descent with the Hessian as a (variable) preconditioner yields the Newton 
method with (local) second order convergence [331 Sec. 1.3.1]. To overcome the high computational 
cost associated with Newton’s method, several approximate Newton methods exist that use cheaper 
second-order models. For example, Gauss-Newton is a particularly popular approximation when 
solving non-linear least-squares problems. For Riemannian optimization, the connection between 
preconditioning and adapting the metric is less immediate and we explore both directions to speed 
up first-order methods. On the one hand, we will consider a rather ad hoc way to precondition 
the Riemannian gradient direction. On the other hand, we will consider an approximate Newton 
method that can be interpreted as a constrained Gauss-Newton method. This requires setting up 
and solving linear systems with the Riemannian Hessian or an approximation thereof. In |62j . it 
was shown that neglecting curvature terms in the Riemannian Hessian leads to an efficient low- 
rank solver for Lyapunov matrix equations. We will extend these developments to more general 
equations with tensors approximated in the Tucker and the TT formats. 

Riemannian optimization is by no means the only sensible approach to finding low-rank tensor 
approximations to the solution of the linear system ([^. For linear operators only involving the 
Laplace-like operator (§, exponential sum approximations [niEi] and tensorized Krylov subspace 
methods |35j are effective and allow for a thorough convergence analysis. For more general equations, 
a straightforward approach is to apply standard iterative methods, such as the Richardson iteration 
or the GG method, to ([^ and represent all iterates in the low-rank tensor format; see [i [131 [13 
1291136j for examples. One critical issue in this approach is to strike a balance between maintaining 
convergence and avoiding excessive intermediate rank growth of the iterates. Only recently, this 
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has been analyzed in more detail [ 3 . A very different approach consists of applying alternating 
optimization techniques to the constrained optimization problem ([^. Such methods have originated 
in quantum physics, most notably the so called DMRG method to address eigenvalue problems in 
the context of strongly correlated quantum lattice systems, see [53] for an overview. The ideas 
of DMRG and related methods have been extended to linear systems in the numerical analysis 
community in muniiMiiin] and are generally referred to as alternating linear schemes (ALS). 
While such methods often exhibit fast convergence, especially for operators of the form ([^, their 
global convergence properties are poorly understood. Even the existing local convergence results for 
ALS |52l |59| offer little intuition on the convergence rate. The efficient implementation of ALS for 
low-rank tensor formats can be a challenge. In the presence of larger ranks, the (dense) subproblems 
that need to be solved in every step of ALS are large and tend to be ill-conditioned. In IMlISZj, 
this issue has been addressed by combining an iterative solver with a preconditioner tailored to the 
subproblem. The design of such a preconditioner is by no means simple, even the knowledge of an 
effective preconditioner for the full-space problem Q is generally not sufficient. So far, the only 
known effective preconditioners are based on exponential sum approximations for operators with 
Laplace-like structure (§, which is inherited by the subproblems. 

Compared to existing approaches, the preconditioned low-rank Riemannian optimization meth¬ 
ods proposed in this paper have a number of advantages. Due to imposing the manifold constraint, 
the issue of rank growth is completely avoided. Our methods have a global nature, all components 
of the low-rank tensor format are improved at once and hence the stagnation typically observed 
during ALS sweeps is avoided. Moreover, we completely avoid the need for solving subproblems 
very accurately. One of our methods can make use of preconditioners for the full-space problem 
while for the other methods preconditioners are implicitly obtained from approximating the Rie¬ 
mannian Hessian. A disadvantage shared with existing methods, our method strongly relies on the 
decomposition ([^ of the operator to construct effective preconditioners. 

In passing, we mention that there is another notion of preconditioning for Riemannian optimiza¬ 
tion on low-rank matrix manifold, see, e.g., mSilllT]. These techniques address the ill-conditioning 
of the manifold parametrization, an aspect that is not related and relevant to our developments, as 
we do not directly work with the parametrization. 

The rest of this paper is structured as follows. In Section we briefly review the Tucker 
and TT tensor formats and the structure of the corresponding manifolds. Then, in Section 
a Riemannian variant of the preconditioned Richardson method is introduced. In Section [^ we 
incorporate second-order information using an approximation of the Riemannian Hessian of the 
cost function and solving the corresponding Newton equation. Finally, numerical experiments 
comparing the proposed algorithms with existing approaches are presented in Section 


2 Manifolds of low-rank tensors 

In this section, we discuss two different representations for tensors X G ]^niXTi 2 x...xnd^ namely 
the Tucker and tensor train/matrix product states (TT/MPS) formats, along with their associated 
notions of low-rank structure and their geometry. We will only mention the main results here and 
refer to the articles by Kolda and Bader |3T| and by Oseledets [50] for more details. More elaborate 
discussions on the manifold structure and computational efficiency considerations can be found in 
[SnilM] for the Tucker format and in [3ni[5Sl[^ for the TT format, respectively. 

2.1 Tucker format 

Format. The multilinear rank of a tensor X G ]jnixn 2 x---xnd jg defined as the d-tuple 
rankML(X) = (ri,r 2 ,... ,rd) = (rank(X(i)),rank(X( 2 )),...,rank(X(d))) 

with 

X(^) g /r=I,...,d, 

the /rth matricization of X; see |31j for more details. 
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Any tensor X G M^i^^2x---xnd multilinear rank r = (ri, r2, - ■ -, Vd) can be represented as 

ri rd-i 

X(zi,..., id^ — E-- E s(ji, j 2 , ■ ■ ■ ,jd)Ui{ii,ji)U 2 ii 2 ,j 3 ) ■ ■ ■ Udiid-I,jd), ( 6 ) 

ji=i jd-i=i 

for some core tensor S € and factor matrices C/^ S , /r = 1, ..., d. In the following, 

we always choose the factor matrices to have orthonormal columns, C/JC/^ = 

Using the /ith mode product x^, see m , one can write (§ more compactly as 

X = Sxit/i X2C/2 ••• XdC/d. ( 7 ) 

Manifold structure. It is known [ 30 l [201 [ 11 ] that the set of tensors having multilinear rank r 
forms a smooth submanifold embedded in lR"ixn2x---xnd_ This manifold Mr is of dimension 

d—1 d 

dimMr = n + E ~ 

fl — 1 fl — 1 

For X e Mr represented as in any tangent vector ^ S TxMr can be written as 
f, = Sx,SU, X2U2 ■■■XdUd + SxiUi X26U2 ■■■XdUd 

+ ■ ■ ■ + S Xi Ui X2 U2 ■ ■ ■ Xd SUd + SS Xi Ui X2 U2 ■ ■ ■ Xd Ud, 

for some first-order variations dS G U’'i-X---xrd g This representation of tangent 

vectors allows us to decompose the tangent space TxMr orthogonally as 

TxM = Vi © V 2 © ■ ■ ■ © Vd © Vd+i, with T Vi, \/fi ^ v, (9) 

where the subspaces are given by 

d 

V^ = |sx^dU^ X ( 5 C/JC/^ = o|, = ( 10 ) 

*■ i/=i 


and 

Vd+i = jds X U,-. 

In particular, this decomposition shows that, given the core tensor S and factor matrices C/^ of X, 
the tangent vector ^ is uniquely represented in terms of dS and gauged SU^ . 


Projectiou outo taugeut space. Given Z G R^ix-'-xn^^ components and dS of the 
orthogonal projection ^ are given by (see m Eq.(2.7)]) 


ds = z X c/y 

/i=i 


su^ = (4^ - t/^c/y z X u; 

12=1 


s'! 

ii)^y 


(11) 


where ^ is the Moore-Penrose pseudo-inverse of S(^). The projection of a 

Tucker tensor of multilinear rank f into TxMr can be performed in 0{dnfr‘^~^ + r‘^r) operations, 
where we set f := max^ r^, r := max^ and f > r. 
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2.2 Representation in the TT format 

Format. The TT format is (implicitly) based on matricizations that merge the first fi modes into 
row indices and the remaining indices into column indices: 

X<M> g /i = 1, . . . , d - 1. 

The TT rank of X is the tuple rankTT(X) := (tq, ri,..., r^) with = rank(X^^^). By definition, 
ro = rd = l ■ 

A tensor X G K"ixn 2 x---xnd rprp j, _ ^ admits the representation 


X(ii ,... ,id) = Ui{ii)U2{i2) ■ ■ • Ud{id) 


( 12 ) 


where each is a matrix of size r^_i x for = 1, 2 ,..., n 

= 1, 2,..., into third-order tensors of size r^_i x 
we can also write (12) as 


By stacking the matrices 
X r^, the so-called TT cores, 


ri 'Td-i 

X(zi ,..., id) E- E u i(1A1, jl)U2(j2,i2,j3) • • •Ud(jd_l,ld,l)- 

ii=i jd_i=i 


To access and manipulate individual cores, it is useful to introduce the interface matrices 

x<^ = [[/l(h)c/ 2 (^ 2 ) • • • e 
x>^ = [[/^(^^)C/^+l(v+l) • • • Ud{id)V G 

and 

X^^ = X>^+i (g) (g X<^_i G (^ 3 ) 

In particular, this allows us to pull out the /xth core as vec(X) = Xg^ vec(U^), where vec(-) denotes 
the vectorization of a tensor. 

There is some freedom in choosing the cores in the representation ( |12[ ). In particular, we can 
orthogonalize parts of X. We say that X is ^-orthogonal if X^j^X<y = 1,.^ for all = 1,..., /i — 1 
and X>i^X>^, = Ir„_i for all i/ = fj, 1,... ,d, see, e.g., [55] for more details. 


Manifold structure. The set of tensors having Hxed TT rank, 

Mr = {X G : rankTT(X) = r} , 

forms a smooth embedded submanifold of K”iX'"Xnd^ ggg EiliniiMi, of dimension 

d d—1 

dim Mr = “ E 

fl—1 IT — 1 

Similar to the Tucker format, the tangent space TxMr at X G Mr admits an orthogonal decom¬ 
position: 

TxMr = Vi © V 2 © ■ ■ ■ © Vd, with T Vi, W fj, ^ v. (14) 

Assuming that X is d-orthogonal, the subspaces can be represented as 

V^ = {Xg^vec(^U^):( 5 U^GM’'-^""-^^^(UL)^WL = 0 }, fi = 1 ,... ,d - 1 , 

Vd = {X^dvec(^Ud): 5Ud G . 

Here, = V<^> g is called the left unfolding of and it has orthonormal columns 

for /j, = 1,..., d — 1, due to the d-orthogonality of X. The gauge conditions (Uj;;)^dUj^ = 0 for 
gi d ensure the mutual orthogonality of the subspaces and thus yield a unique representation 
of a tangent vector ^ in terms of gauged (5X1;^. Hence, we can write any tangent vector f G TxMr 
in the convenient form 

d 

^ = vec(dU^) G s.t. (UL)T<5UL =0, V/r ^ d. (16) 

fl^l 
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Projection onto tangent space. The orthogonal projection PtxA^ onto the tangent space TxAt 
can be decomposed in accordance with (14): 


Ptx>1 =P'+P' + --- + P^ 

where are orthogonal projections onto Let X G A4r be d-orthogonal and Z G 
Then the projection can be written as 


d 

PTx>t.(Z) = ^P'^(Z) where P^(Z) = vec((5U^). 

For ^ = 1,..., d — 1, the components in this expression are given by [40l p. 924] 


-pL)(4 


.X5^_OZ<'^>X>^+i(x5^+iX>^+i 


, -1 


with P^ = the orthogonal projector onto the range of Uj;. For /i = d, we have 


= (/„ 


X^,_i)Z 


<d> 


(17) 


(18) 


(19) 


The projection of a tensor of TT rank f into T^Air can be performed in 0{dnrf‘^) operations, 


where we set again f := max^ r := max^ and f > r. 


it 


Remark 1. Equation (18) is not well-suited for numerical calculations due to the presence of the 
inverse of the Gram matna; Xy_|_j^X>^+i, which is typically severely ill-conditioned. In 1281 T 
was shown that by p-orthogonalizing the fith summand of the tangent vector representation, these 
inverses can be avoided at no extra costs. To keep the notation short, we do not include this 
individual orthogonalization in the equations above, but make use of it in the implementation of the 
algorithm and the numerical experiments discussed in Section^ 


2.3 Retractions 

Riemannian optimization algorithms produce search directions that are contained in the tangent 
space Tx_Air of the current iterate. To obtain the next iterate on the manifold, tangent vectors are 
mapped back to the manifold by application of a retraction map R that satisfies certain properties; 
see [21 Def. 1] for a formal definition. 

It has been shown in [22] that the higher-order SVD (HOSVD) [T^ . which aims at approximating 
a given tensor of rank f by a tensor of lower rank r, constitutes a retraction on the Tucker manifold 
A4r that can be computed efficiently in 0{dnf'^ operations. For the TT manifold, we will use 
the analogous TT-SVD (SOI Sec. 3] for a retraction with a computational cost of 0(dri®), see [55] . 
For both manifolds, we will denote by i?(X -|- ^) the retractioiP of ^ G TxAir that is computed by 
the HOSVD/TT-SVD of X -b 


3 First-order Riemannian optimization and preconditioning 

In this section, we discuss ways to incorporate preconditioners into simple first-order Riemannian 
optimization methods. 

3.1 Riemannian gradient descent 

To derive a first-order optimization method on a manifold Air, we first need to construct the 
Riemannian gradient. For the cost function ([^ associated with linear systems, the Euclidean 
gradient is given by 

V/(X) = AX-F. 

^Note that the domain of definition of R is the affine tangent space X -|- TxAlr, which departs from the usual 
convention to define R on TxAIr and only makes sense for this particular type of retraction. 
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For both the Tucker and the TT formats, A^r is an embedded submanifold of and hence 

the Riemannian gradient can be obtained by projecting V/ onto the tangent space: 

grad/(X)=PTx>t.(-4X-F). 

Together with the retraction R of Section |2.3[ this yields the basic Riemannian gradient descent 
algorithm: 

Xfc+i = i?(X/c + with V/(X/c). (20) 

As usual, a suitable step size ak is obtained by standard Armijo backtracking linesearch. Follow¬ 
ing |61j . a good initial guess for the backtracking procedure is constructed by an exact linearized 
linesearch on the tangent space alone (that is, by neglecting the retraction): 

• ii-v , ^ ^ (^fc>V/(Xfe)) 

argmm/(Xfe -f a^k) = 


3.2 Truncated preconditioned Richardson iteration 


Truncated Richardson iteration. The Riemannian gradient descent dehned by (201 closely 
resembles a truncated Richardson iteration for solving linear systems: 


Xfc+i = i?(Xfc -I- with = — V/(Xfc) = F — AX/j 


( 22 ) 


which was proposed for the CP tensor format in [29] . For the hierarchical Tucker format, a variant 


of the TT format, the iteration (22) has been analyzed in |^. In contrast to manifold optimization, 
the rank does not need to be fixed but can be adjusted to strike a balance between low rank and 
convergence speed. It has been observed, for example in [35], that such an iterate-and-truncate 
strategy greatly benefits from preconditioners, not only to attain an acceptable convergence speed 
but also to avoid excessive rank growth of the intermediate iterates. 


Preconditioned Richardson iteration. For the standard Richardson iteration X^+i = X^ — 
Oik^k, a symmetric positive definite preconditioner V for A can be incorporated as follows: 

Xk+i = Xk + akT’~^^k with ^fc=F —AXfc. (23) 

Using the Cholesky factorization P = CC^, this iteration turns out to be equivalent to applying the 
Richardson iteration to the transformed symmetric positive definite linear system 

C-^AC~~^Y = C-^F 


after changing coordinates by C Xk- At the same time, (23) can be viewed as applying gradient 
descent in the inner product induced by V. 


Truncated preconditioned Richardson iteration. The most natural way of combining trun¬ 
cation with preconditioning leads to the truncated preconditioned Richardson iteration 


Xk+i = R{Xk + akV ^Cfe), with ^k=F-AX, 


k-) 


(24) 


see also [29j . In view of Riemannian gradient descent (20), it appears natural to project the search 


direction to the tangent space, leading to the “geometric” variant 

Xk+i = R{Xk + akPT:x.^Mr'P~^^k), with = F — AX/; 


(25) 


In terms of convergence, we have observed that the scheme (25) behaves similar to (24); see 


§5.3[ However, it can be considerably cheaper per iteration: Since only tangent vectors need to be 
retracted in (25), the computation of the HOSVD/TT-SVD in R involves only tensors of bounded 
rank, regardless of the rank of 'P~^^k- In particular, with r the Tucker/TT rank of X^, the 
corresponding rank of Xk + ak PTx^^Air 'P~^^k is at most 2r; see |34l §3.3] and [55l Prop. 3.1]. On 
the other hand, in (24) the rank of X^, -|- ak'P~^^k is determined primarily by the quality of the 


preconditioner V and can possibly be very large. 
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Another advantage occurs for the special but important case when 'P~^ = where 

each term Va is relatively cheap to apply. For example, when 'P~^ is an exponential sum precondi¬ 
tioner |10j then s = d and Va is a Kronecker product of small matrices. By the linearity of PTx^A 4 r 5 
we have 

S 

PTxfcXr ^ PTx^Mr 'Pa^k, ( 26 ) 

a—1 


which makes it often cheaper to evaluate this expression in the iteration (25). To see this, for 
example, for the TT format, suppose that Va^ has TT ranks r^. Then the preconditioned direction 
V~^^k can be expected to have TT ranks sVp. Hence, the straightforward application of 
to V~^^k requires 0{dn{srp)‘^r) operations. Using the expression on the right-hand side of (26) 
instead reduces the cost to 0{dnsr‘^r) operations, since the summation of tangent vectors amounts 
to simply adding their parametrizations. In contrast, since the retraction is a non-linear operation, 
trying to achieve similar cost savings in (24) by simply truncating the culmulated sum subsequently 
may lead to severe cancellation [351 §6.3]. 


4 Riemannian optimization using a quadratic model 

As we will see in the numerical experiments in Section]^ the convergence of the first-order methods 
presented above crucially depends on the availability of a good preconditioner for the full problem. 
In this section, we present Riemannian optimization methods based on a quadratic model. In these 
methods, the preconditioners are derived from an approximation of the Riemannian Hessian. 


4.1 Approximate Newton method 

The Riemannian Newton method [T] applied to ([^ determines the search direction from the 
equation 

Hx.6 = -PTx>t.V/(Xfe), (27) 

where the symmetric linear operator "Hx*, : Tx^AIr —t Tx^-^r is the Riemannian Hessian of ([^. 
Using |2], we have 


Hx. = Ptx,a.. [VV(Xfc) + Jx,VV(Xfc)] Ptx,A4. 
= Ptx. Mr [A + JXfc (AXfe - F)] Ptx. Mr 


(28) 


with the Frechet derivativ^ Jx^ of PTx^Mr- 

As usual, the Newton equation is only well-defined near a strict local minimizer and solving 
it exactly is prohibitively expensive in a large-scale setting. We therefore approximate the linear 
system (27) in two steps: First, we drop the term containing Jx^, and second, we replace A = L-\-V 


by £. The first approximation can be interpreted as neglecting the curvature of Air> or equivalently, 
as linearizing the manifold at X^. Indeed, this term is void if Air would be a (flat) linear subspace. 
This approximation is also known as constrained Gauss-Newton (see, e.g, 0) since it replaces the 
constraint X S Air with its linearization X S TxAIr and neglects the constraints in the Lagrangian. 
The second approximation is natural given the assumption of £ being a good preconditioner for 
A = C -\-V ■ In addition, our derivations and numerical implementation will rely extensively on the 
fact that the Laplacian £ acts on each tensor dimension separately. 

The result is an approximate Newton method were the search direction is determined from 


P^Xj^Mr -^PTx^Mr Cfc = PTxMr(P “ AXfc ) . (29) 

Since £ is positive definite, this equation is always well-defined for any Xj,. In addition, is also 
gradient-related and hence the iteration 

Xfc+i = i?(Xfc -I- ak^k) 

is guaranteed to converge globally to a stationary point of the cost function if au is determined 
from Armijo backtracking [T]. 


^ Jx^ is an operator from R’*X"X - X" to the space of self-adjoint linear operators Tx^Mr —t Tx^Atr- 







Despite all the simplifications, the numerical solution of (29) turns out to be a nontrivial task. 
In the following section, we explain an efficient algorithm for solving (29) exactly when Air is the 


Tucker manifold. For the TT manifold, this approach is no longer feasible and we therefore present 


an effective preconditioner that can used for solving (29) with the preconditioned CG method. 


4.2 The approximate Riemannian Hessian in the Tucker case 


The solution of the linear system ( [2^ was addressed for the matrix case {d = 2) in [62l Sec. 7.2]. 
In the following, we extend this approach to tensors in the Tucker format. To keep the presentation 
concise, we restrict ourselves to d = 3; the extension to d > 3 is straightforward. 


For tensors of order 3 in the Tucker format, we write (29) as follows: 




(30) 


where 


• X S Air is parametrized by factor matrices Ui,U 2 ,U 3 having orthonormal columns and the 
core tensor S; 


• the right-hand side n € TvAIr is given in terms of its qauqed parametrization SUT.SUn.SUS 
and dS'', as in (||) and ([^; 


• the unknown S TxAIr is to be determined in terms of its gauged parametrization SUi, dC/ 2 , SU 3 
and dS, again as in (fsl) and (10). 


To derive equations for 5Ufi with ^ = 1,2,3 and dS we exploit that TxAIr decomposes orthogo¬ 
nally into Vi © ■ ■ ■ 0 V 4 ; see (§. This allows us to split ( [^ into a system of four coupled equations 
by projecting onto for /r = 1,..., 4. 

In particular, since ^ € TxAIr by assumption, we can insert Z := TPtxAIp'? ~ ( [TT| ). 

By exploiting the structure of L (see ([^) and the orthogonality of the gauged representation of 
tangent vectors (see we can simplify the expressions considerably and arrive at the equations 


= Pci + Li 5 UiS(^) + dC/i 5 (i) [Ir, ® UJL2U2 + UJL3U3 ® Ir,])sl^ 

6U2 = (yL2U2dS(2) + L 25 U 2 S( 2 ) + dU2S(2) [Aa ® Uj LiUi + uj L3U3 ® In] ^ 

SUS = {l3U3SS^3) + T 3 dt/ 35 ( 3 ) + SUsS^s) [Ir, ® UjLiUi + UJL 2 U 2 ® 

dS’' = [UjLiUiSS^i) + C/TTidC/i5(i)]^^^ + [uJL 2U2SS^2) + Uj L2SU2S^2)]^^'’ 

+ [UjL3U3SS^3)+UjL3SU3S^3)f\ 


Additionally, the gauge conditions need to be satisfied: 

Uj 6 Ui = UJSU 2 = UJ 6 U 3 = 0. 


(32) 


In order to solve these equations, we will use the first three equations of (31), together with (32) 


to substitute dC/^ in the last equation of (31) and determine a decoupled equation for dS. Rear¬ 
ranging the first equation of , we obtain 

P^^ (TidDi + dC/i^(i) [Ir, ® UJL 2 U 2 + UJL 3 U 3 ® In] = SU^ - P^^ LiUiSS^ 3 )Sly 


Vectorization and adhering to (p^ yields the saddle point system 


G 

In ® Ui 


vec(d[/i) 


h' 

In © Uj 

0 




0 


where 


(33) 


G — Ir-i © Ti + (S’jj^^)''' (/r 3 © uj L 2 U 2 + uj L 3 U 3 © Ir 2 )sji'f ® Ini, 
h = veciSU^) - ((Af,/ © P^^ LiC/i) vec(d5(i)). 
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and 2/1 G M’'i is the dual variable. The positive definiteness of Li and the full rank conditions on 
Ui and S imply that the above system is nonsingular; see, e.g., [?]• Using the Schur complement 
Gs = —{In ® Ui)~'^G~^{Iri (8) Ui), we obtain the explicit expression 


Yec{6Ui) = (^Inm + G ^(/ri (8 Ui)Gg^(/ri (8 C/7 ))g = Wi - Fi vec((5S'(i)), 

with 

Wi := + G-\lr, 8 Ui)Gs\lr., 8 UJ))g-^ vec{SU^), 

Fi := +G-1(/,,8Gi)G51(4,8C/T))g-i((5J,))T^P^^2.iC/i). 


Expressions analogous to (34) can be derived for the other two factor matrices: 


Yec{SU2) =W2-F2 vec(^S'(2)), 
vecidUs) =W3- F 3 vec((5S'(3)), 


(34) 


with suitable analogs for W 2 ,W 3 , F 2 , and F 3 . These expressions are now inserted into the last 
equation of (31) for 6 S'^. To this end, define permutation matrices that map the vectorization 
of the ith matricization to the vectorization of the jth matricization: 


vec(5S(i)) = vec((5S(j)), 

By definition, vec(^5'(i)) = vec(5S), and we finally obtain the following linear system for vec((5S): 

F vec{ 6 S) = vec{ 6 S'^) - 8 UjLi)wi - 8 UjL 2 )w 2 - n 3 ^i(S'^) 8 Uj ^ 3 )^ 3 , (35) 

with the rir 2 r 3 x rir 2 r 3 matrix 


F := 8 UjLiU^ - {Sj^) 8 U^LOFi + n2^i ® UjL 2 U 2 - (5^2) ® L 2 )F 2 


n 


l -)-2 


u. 


3^1 


UJL 3 U 3 - (^(T) 8 UJL3)F3 


n 


l-i-3- 


For small ranks, the linear system ( |35[ ) is solved by forming the matrix F explicitly and using a 
direct solver. Since this requires 0^r|r|) operations, it is advisable to use an iterative solver 
for larger ranks, in which the Kronecker product structure can be exploited when applying F; see 
also I 


Once we have obtained (5S, we can easily obtain SUi, 61 / 2 , 61/3 using (34). 


Remark 2. The application of G ^ needed in (34) as well as in the construction of Gs can 


be implemented efficiently by noting that G is the matrix representation of the Sylvester operator 
V I—)■ LiV + Ur7, with the matrix 


Ti := (^fi)) 


'{Ir: 




U 2 


U 3 L 3 U 3 


)/y 


The ri x ri matrix Ti is non-symmetric but it can be diagonalized by first computing a QR decom¬ 
position = QsRs such that Q"sQs = Iri and then computing the spectral decomposition of the 
symmetric matrix 

Ql{lr, 8 UJL 2 U 2 + UJL 3 U 3 8 In)Qs- 

After diagonalization o/Ti, the application of G~^ requires the solution ofri linear systems with the 
matrices Li + A/, where X is an eigenvalue ofTi; see also The Schur complement Gs G 
is constructed explicitly by applying G~^ to the rf columns of 8 Ui. 

Analogous techniques apply to the computation ofw 2 ,F 2 , and ^ 3 ,^ 3 . 

Assuming, for example, that each is a tri-diagonal matrix, the solution of a linear system 
with the shifted matrix XI can be performed in 0{n) operations. Therefore, using Remark]^ 
the construction of the Schur complement Gs requires 0{nr^) operations. Hence, the approximate 
Newton equation (30) can be solved in 0{dnr^ -\-r^) operations. This cost dominates the complexity 


of the Riemannian gradient calculation and the retraction step. 
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4.3 The approximate Riemannian Hessian in the TT case 

When using the TT format, it seems to be much harder to solve the approximate Newton equa¬ 
tion (j^ directly and we therefore resort to the preconditioned conjugate gradient (PCG) method 
for solving the linear system iteratively. We use the following commonly used stopping criterion [481 
Ch. 7.1] for accepting the approximation ^ produced by PCG: 


I PTx,Ar.[/:C - V/(Xfc)]|| < min 0.5, ,/|| Ptx,^. V/(Xfc 


Ptx„^.V/(X,)||. 


To derive an effective preconditioner for PCG, we first examine the approximate Newton equa¬ 
tion (291 more closely. For d-dimensional tensors in the TT format, it takes the form 


PTxAIr ^ PTxVVtr C = 


(36) 


where 

• X S A4r is parametrized by its cores Ui, U 2 ,..., Ud and is d-orthogonal; 

• the right-hand side ri G is represented in terms of its gauged parametrization dU?, 

dU’', ..., dU’j, as in ([^; 

• the unknown ^ G TyAir needs to be determined in terms of its gauged parametrization 
dUi, i 5U2, ..., (5Ud, again as in (16). 


When PCG is applied to (36) with a preconditioner B: TyAir TxAir, we need to evaluate 
an expression of the form ^ = Bg for a given, arbitrary vector ry G TxAlr- Again, ^ and ry are 
represented using the gauged parametrization above. 

We will present two block Jacobi preconditioners for (36); both are variants of parallel subspace 
correction (PSC) methods [SS]. They mainly differ in the way the tangent space TxAJr is split into 
subspaces. 

4 . 3.1 A block diagonal Jacobi preconditioner 


The most immediate choice for splitting TxAir is to simply take the direct sum (14). The PSC 
method is then defined in terms of the local operators 




/:^ = p'^£pn 


v„ ’ 


yi = l,...,d. 


where P^ is the orthogonal projector onto see (2.2 The operators are symmetric and 
positive definite, and hence invertible, on V;_j. This allows us to express the resulting preconditioner 
as [M] §3.2] 


d d 

B = J2£~^P>^ = Y,[p^£P^\^^y P^. 






The action of the preconditioner ^ = Br] can thus be computed as ^ with 


C^=(p^/:P^IvJ 'P^ry, t, = l,...,d. 


Local problems. The local equations determining 
P^£P>^^^ = P>^rj, 


/r = l,...,d, 


(37) 


can be solved for all G in parallel. By (15), we have = X^^ vec((5U^) for some gauged 
Since P^Ty satisfies an expansion analogous to (16), straightforward properties of the projectors P^ 
allow us to write (37) as 

P'^ £X^^ vec(<5U^) = X^^ vec(JU;]), fi=l,...,d, 
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under the additional constraint = 0 when fj, ^ d. Now expressing the result of 

applied to vec((5U^) as in 0 and using ( [I^ leads to 

= (5U;i)L (38) 

for fj, ^ d, while 0 for /i = d leads to the equation 

T r n <d> 

/:X^dvec(dUrf) =(<5U2)K (39) 


Using (13), the application of the Laplace-like operator C to X^^ can be decomposed into three 
parts, 

/IX^^ = 0 In^ ® X<^_i -I- X>^_|_i 0 L;_j 0 X<^_i + X>^+i 0 0 (40) 

with the reduced leading and trailing terms 


/A*-! 

•^<At-l = I In^-i ( 

\!y=l 
/ ^ 

= I I-rid 

\i/=/i+l 






)'>U 


vT 


Some manipulatior0 establishes the identity 

[£X^^ vec(dU^)] = (J„^ 0 X<^_i) 0 X<^_i + 0 £<^_i 

Inserting this expression into ( [38| ) yields for /r d 

- P^) [WL£5^+,X>^+i (x5^+iX>^+i) 

+ (£m ® ® X5^_i£<^_i) dU^] = (du;()K 

After defining the (symmetric positive definite) matrices £<^_i = X<^_j^£<^_i and £>/i+i = 
X>^_i_]^£>^+i, we finally obtain 

- P|l) [dUL£>^+i(Xj^+iX>^+i)”' + ® + In, ® = (dU;()S (41) 


with the gauge condition (dUj^)^U[^ = 0. For fi = d, there is no gauge condition and (39) becomes 

dUL + (L, 0 Ir, + 0 £<d_i) dUL = (dU2)L. (42) 

Efficient solution of local problems. The derivations above have led us to the linear sys¬ 
tems (41) and (42) for determining the local component While (42) is a Sylvester equation 
and can be solved with standard techniques, more work is needed to address (41) efficiently. Since 
£>^+1 and X>^^j^X>^+i are symmetric positive definite, they admit a generalized eigenvalue de¬ 
composition: There is an invertible matrix Q such that C>^+iQ = (X^^+iX>^+i)(3A with A 
diagonal and (5^(X>^_,_^X>^+i)(5 = Ir,- This transforms ( [4l] ) into 

- P^l) [5 ULqTa + (L^ 0 0 £<^_i) dU^QT] ^ 

Setting dU^ = and (dU^)*- = (dU)()'-(5^, we can formulate these equations column-wise: 

+£^0 0£<^_i] dUL(:,*) = (dU2)^(:,i), (43) 


^This is shown by applying the relation ® X<^_i which holds for any TT tensor 1391 

eq. (2.4)], to vec((5U^). 
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where Xi = A{i,i) > 0. Because Q is invertible, the gauge-conditions on are equivalent to 
= 0. Combined with (431, we obtain - similar to (33) - the saddle point systems 


■ UL- 

■<5UL (:,*)■ 


■TO)H:,*)' 

.(Uli)^ 0 _ 

y 


0 


with the symmetric positive definite matrix 


(44) 


(45) 


and the dual variable y S The system (44) is solved for each column of i5U)^: 


<5UL(:,*) = (/, 


(UL)T)G;j (W;()L(:,*), 


using the Schur complement Gs '■= —(Uj;)^G^Transforming back eventually yields = 

ajLQ-T 


Remark 3. Analogous to Remark 
defined in (45) represents the 


\si the application of G^ \ 
Sylvester operator 


benefits from the fact that the matrix 


V [L^ XiIn^)V + 


After diagonalization o/£<^_i, the application of G^\ requires the solution of r^ linear systems 
with the matrices + {Xi + where j3 is an eigenvalue o/£<^_i. The Schur complements 

Gs S are constructed explicitly by applying G“) to the r^ columns ofXJ^. 

Assuming again that solving with the shifted matrices Lfj,+{Xi+/3)In^ can be performed in 0{n^) 
operations, the construction of the Schur complement Gg needs 0{n^r‘jf) operations. Repeating 
this for all columns of and all cores p, = 1,..., d — 1 yields a total computational complexity 
of 0{dnr^) for applying the block-Jacobi preconditioner. 


4.3.2 An overlapping block-Jacobi preconditioner 

The block diagonal preconditioner discussed above is computationally expensive due to the need 
for solving the saddle point systems (44). To avoid them, we will construct a PSC preconditioner 
for the subspaces 


:= {X^^vec(JU^): (5U^ S 


pr._i x-Rm XT, 


} = spanX^^, 


p = l,...,d. 


Observe that Vf^ C for p d. Hence, the decomposition TxMr = is no longer a direct 

sum as in (14). The advantage of over V^, however, is that the orthogonal projector onto 
is considerably easier. In particular, since X is d-orthogonal, we obtain 


P'^ = X^^(XT^X^^)-1XT^ = X^^ [(X5^+iX>^+i)-i 0 Xj^. 


(46) 


The PSC preconditioner corresponding to the subspaces V^j, is given by 




^=1 


-1 


pM 


The action of the preconditioner ^ = Br] can thus be compnted as f "^i^h 


(47) 
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Local problems. To solve the local equations (471, we proceed as in the previous section, but the 
resulting equations will be considerably simpler. Let P^rj = X^^vec((5U^) for some which 

will generally differ from the gauged (5U^ parametrization of r]. Writing vec((5U^), we 

obtain the linear systems 


P'^/:X^^vec(5U^) = X^^vec(<5U;') 


for p, = 1,..., d. Plugging in (461 gives 


[(X^^+iX>^+i)-i 0 XT^/:X^^ vec(dU^) = vec(dU;i). 


(48) 


Analogous to (40), we can write 


with the left and right parts 


f'tj.-i 


^ d 

E 


’ ^ni I 




< (g) . . . Irif^+i I 


Vi^=/i+l 


Again, it is not hard to show that 

(xT^£X^^ vec(dU^))<'^> = dVlC>^+, + {L^ 0 0 

Hence, ( [4^ can be written as 

( 5 uL/:>^+i(x^^+iX>^+i)-' + (L^ 0 Ir^_, + 0 /:<^-i)duL = (< 5 u;()K (49) 


Efficient solution of local problems. The above equations can be directly solved as follows: 


Using the generalized eigendecomposition of £>^+iQ = (X>^_,_^X>^+i)(5A, we can write (491 
column-wise as 




with the system matrix 


^® 4 “ £fi ® 4 “ ^ 

and the transformed variables dUj;; := and (dU)()'- := (dU)()'-(3^. Solving with 

can again be achieved by efficient solvers for Sylvester equations, see Remark After forming 
(5Uj^ = for all /r, we have obtained the solution as an ungauged parametrization: 


C = H77 = ^X^^vec(dU^). 


/i=i 


To obtain the gauged parametrization of ^ satisfying (16), we can simply apply (18) to compute 


PTxXr(C) and exploit that ^ is a TT tensor (with doubled TT ranks compared to X). 

Assuming again that solving with can be performed in O(n^) operations, we end up with 
a total computational complexity of O(dnr^) for applying the overlapping block-Jacobi precondi¬ 
tioner. Although this is the same asymptotic complexity as the non-overlapping scheme from §4.3.1[ 
the constant and computational time can be expected to be signihcantly lower thanks to not having 
to solve saddle point systems in each step. 

Remark 4. By fi-orthogonalizing X and transforming as described in m, the Gram matrix 

X>^_i_j^X>^_i_i in (41) and (49) becomes the identity matrix. This leads to a more stable calculation 
of the corresponding unknown see also Remark^ We make use of this transformation in our 

implementations. 
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4.3.3 Connection to ALS 


The overlapping block-Jacobi preconditioner B introduced above is closely related to ALS applied 
to Q. There are, however, crucial differences explaining why B is significantly cheaper per iteration 
than ALS. 

Using vec(X) = vec(U^), one micro-step of ALS fixes X^^ and replaces by the mini- 
mizer of (see, e.g., [53J Alg. 1]) 

min ^(X^^ vec(U^), AX^^ vec(U^)) - (X^^ vec(U^),vec(F)). 


After has been updated, ALS proceeds to the next core until all cores have eventually been 
updated in a particular order, for example, Ui,U 2 ,...,Ud. The solution to the above minimization 
problem is obtained from solving the ALS subproblem 

XT^AX^^vec(U^) = XT^vec(F). 


It is well-known that ALS can be seen as a block version of non-linear Gauss-Seidel. The 
subproblem typically needs to be computed iteratively since the system matrix Xj^AX^^U^ is 
often unmanageably large. 

When X is /r-orthogonal, X>^^j^X>^_|_i = and the ALS subproblem has the same form as 


the subproblem (481 in the overlapping block-Jacobi preconditioner B. However, there are crucial 
differences: 


ALS directly optimizes for the cores and as such uses A in the optimization problem. The 
approximate Newton method, on the other hand, updates (all) the cores using a search di¬ 
rection obtained from minimizing the quadratic model (29). It can therefore use any positive 


definite approximation of A to construct this model, which we choose as C. Since (48) is the 
preconditioner for this quadratic model, it uses C as well. 

ALS updates each core immediately and it is a block version of non-linear Gauss-Seidel for ([^, 
whereas B updates all the cores simultaneously resembling a block version of linear Jacobi. 


Even in the large-scale setting of ^ 10^, the subproblems (48) can be solved efficiently 


in closed form as long as -I- allows for efficient system solves, e.g., for tridiagonal 
L^. This is not possible in ALS where the subproblems have to be formulated with A and 
typically need to be solved iteratively using PCG. 


Remark 5. Instead of PSC, we experimented with a symmetrized version of a successive subspace 
correction (SSC) preconditioner, also known as a back and forth ALS sweep. However, the higher 
computational cost per iteration of SSC was not offset by a possibly improved convergence behavior. 


5 Numerical experiments 

In this section, we compare the performance of the different preconditioned optimization techniques 
discussed in this paper for two representative test cases. 

We have implemented all algorithms in Matlab. For the TT format, we have made use of the 
TTeMPS toolbox, see http://anchp.epfl.ch/TTeMPS. All numerical experiments and timings are 
performed on a 12 core Intel Xeon X5675, 3.07 Ghz, 192 GiB RAM using Matlab 2014a, running 
on Linux kernel 3.2.0-0. 

To simplify the discussion, we assume throughout this section that the tensor size and ranks are 
equal along all modes and therefore state them as scalar values: n = max^ and r = max^ r^. 

5.1 Test case 1: Newton potential 

As a standard example leading to a linear system of the form , we consider the partial differential 
equation 

—Au{x) -I- U(a;) = f{x), x G LI = (—10,10)'^, 
u{x) =0 X G dLl. 
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with the Laplace operator A, the Newton potential V{x) = ||a;||“^, and the source function /: —?> 
M. Equations of this type are used to describe the energy of a charged particle in an electrostatic 
potential. 

We discretize the domain by a uniform tensor grid with n'^ grid points and corresponding mesh 
width h. Then, by finite difference approximation on this tensor grid, we obtain a tensor equation 
of the type 0> where the linear operator A is the sum of the d-dimensional Laplace operator as 
in ([^ with tridiag(—1, 2, —1) S K"^", and the discretized Newton potential V. To create 

a low-rank representation of the Newton potential, V (x) is approximated by a rank 10 tensor V 
using exponential sums [19] . The application of .4 to a tensor X is given by 

ylX = AX -f V o X, 

where o denotes the Hadamard (element-wise) product. The application of this operator increases 
the ranks significantly: If X has rank r then 4lX has rank (2 -|- 10)r = 12r. 


5.2 Test case 2: Anisotropic Diffusion Equation 

As a second example, we consider the anisotropic diffusion equation 

— div(IlVM(x)) = f{x), X S n = (—10,10) 
u{x) = 0 X G dV,, 

with a tridiagonal diffusion matrix D — tridiag(a, 1, a) G The discretization on a uniform 

tensor grid with n'^ grid points and mesh width h yields a linear equation with system matrix 
A = L + V consisting of the potential term 


V = In® ■ ■ ■ ® In® B 2 ® 2aBi + In® ■■■ ® In® B 3 ® 2aB2 ® In + Bd® 2aBd-l ® In® ■ ■ ■ ® In, 


and the Laplace part L defined as in the previous example. The matrix ^ tridiag(—1, 0,1) G 
]R"xn represents the one-dimensional central finite difference matrix for the first derivative. 

The corresponding linear operator A acting on X G ]R"iX'"Xnd ^g^ri be represented as a TT 
operator of rank 3, with the cores given by 


and 


= [Li{ii,ji) 2aBi{ii,ji) /„i(ii,ji)]. 


Ad{id,jd) 


Ind iid, id) 
Bdiid, jd) , 
Bd{id, jd) 




Bf_,{ifd,jf,) 0 0 

Bfdiifd, jfi) 2aBp^(ifd, jfd) In^iitJ.,jtJ.) 


/r = 2,..., d — 1. 


In the Tucker format, this operator is also of rank 3. Given a tensor X in the representation ([^, 
the result Y = AX is explicitly given byY = GxiFi X 2 ■ ■ ■ XdVd with 


= [Ufd B^U^] G 

and core tensor G G K3'’iX "X3rd g block structure shown in Figure for the case d = 3. 



Figure 1: Structure of the core tensor G for the case d = 3 resulting from an application of the 
anisotropic diffusion operator. 

The rank of A increases linearly with the band width of the diffusion matrix D. For example, 
a pentadiagonal structure would yield an operator of rank 4. See also [13] for more general bounds 
in terms of certain properties of D. 
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5.3 Results for the Tucker format 


For tensors represented in the Tucker format we want to investigate the convergence of the trun¬ 
cated preconditioned Richardson (24) and its Riemannian variant (|25|), and compare them to the 


approximate Newton scheme discussed in §4.2[ Figure displays the obtained results for the first 
test case, the Newton potential, where we set d = 3, n = 100, and used multilinear ranks r = 15. 
Figure ^displays the results for the second test case, the anisotropic diffusion operator with a = ^, 
using tne same settings. In both cases, the right hand side is given by a random rank-1 Tucker 
tensor. To create a full space preconditioner for both Richardson approaches, we approximate the 
inverse Laplacian by an exponential sum of k € {5,7,10} terms. It can be clearly seen that the 
quality of the preconditioner has a strong influence on the convergence. For k = 5, convergence is 
extremely slow. Increasing k yields a drastic improvement on the convergence. 




Figure 2: Newton potential with d = 3. Comparison of truncated preconditioned Richardson, trun¬ 
cated Riemannian preconditioned Richardson, and the approximate Newton scheme when applied to 
the Newton potential in the Tucker format. For the Richardson iterations, exponential sum approx¬ 
imations with k G {5,7,10} terms are compared. Left: Relative residual as a function of iterations. 
Right: Relative residual as a function of time 




Figure 3: Anisotropic diffusion with d = 3. Comparison of truncated Preconditioned Richardson, 
truncated Riemannian preconditioned Richardson, and the approximate Newton scheme when ap¬ 
plied to the Newton potential in the Tucker format. For the Richardson iterations, exponential sum 
approximations with k G {5, 7,10} terms are compared. Left: Relative residual as a function of 
iterations. Right: Relative residual as a function of time 

With an accurate preconditioner, the truncated Richardson scheme converges fast with regard to 
the number of iterations, but suffers from very long computation times due to the exceedingly high 
intermediate ranks. In comparison, the Riemannian Richardson scheme yields similar convergence 
speed, but with significantly reduced computation time due to the additional projection into the 
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tangent space. The biggest saving in computational effort comes from relation (26) which allows 
us to avoid having to form the preconditioned residual — AX.k) explicitly, a quantity with 

very high rank. Note that for both Richardson approaches, it is necessary to round the Euclidean 
gradient to lower rank using a tolerance of, say, 10“^ before applying the preconditioner to avoid 
excessive intermediate ranks. 

The approximate Newton scheme converges equally well as the best Richardson approaches with 
regard to the number of iterations and does not require setting up a preconditioner. For the first 
test case, it only needs about half of the time as the best Richardson approach. For the second test 
case, it is significantly slower than Riemannian preconditioned Richardson. Since this operator is of 
lower rank than the Newton potential, the additional complexity of constructing the approximate 
Hessian does not pay off in this case. 


Quadratic convergence. In Figure|^we investigate the convergence of the approximate Newton 
scheme when applied to a pure Laplace operator, A = L, and to the anisotropic diffusion operator 
A = L + V. In order to have an exact solution of known rank r = 4, we construct the right hand 
side by applying A to a random rank 4 tensor. For the dimension and tensor size we have chosen 
d = 3 and n = 200, respectively. By construction, the exact solution lies on the manifold. Hence, if 
the approximate Newton method converges to this solution, we have zero residual and our Gauss- 


Newton approximation of (28) is an exact second-order model despite only containing the A term. 


In other words, we expect quadratic convergence when A = L but only linear when A = L + V since 
our approximate Newton method (29) only solves with L. This is indeed confirmed in Figure]^ 



Figure 4: Convergence of the approximate Newton method for the zero-residual case when applied 
to a pure Laplace operator L and to the anisotropic diffusion operator L -\-V. 


5.4 Results for the TT format 


In the TT format, we compare the convergence of our approximate Newton scheme (with the 


overlapping block-Jacobi preconditioner described in (4.3.2) to a standard approach, the alternating 
linear scheme (ALS). 

We have chosen d = 60, n = 100, and a random rank-one right hand sides of norm one. In the 
first test case, the Newton potential, we have chosen TT ranks r = 10 for the approximate solution. 
The corresponding convergence curves are shown in Figure We observe that the approximate 
Newton scheme needs significantly less time to converge than the ALS scheme. As a reference, we 
have also included a steepest descent method using the overlapping block-Jacobi scheme directly 
as a preconditioner for every gradient step instead of using it to solve the approximate Newton 
equation ( |36[ ). The additional effort of solving the Newton equation approximately clearly pays off. 

In Figure we show results for the anisotropic diffusion case. To obtain a good accuracy of 
the solution, we have to choose a relatively high rank of r = 25 in this case. Here, the approximate 
Newton scheme is still faster, especially at the beginning of the iteration, but the final time needed 
to reach a residual of 10“^ is similar to ALS. 
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Figure 5: Newton potential with d = 60. Convergence of ALS compared to preconditioned steepest 
descent with overlapping block-Jacobi as preconditioner and the approximate Newton scheme. Left: 
Relative residual as function of iterations. Right: Relative residual as function of time. 




Figure 6: Anisotropic diffusion with d = 60. Convergence of ALS compared to preconditioned steep¬ 
est descent with overlapping block-Jacobi as preconditioner and the approximate Newton scheme. 
Left: Relative residual as function of iterations. Right: Relative residual as function of time. 


Note that in Figures and the plots with regard to the number of iterations are to be read 
with care due to the different natures of the algorithms. One ALS iteration corresponds to the 
optimization of one core. In the plots, the beginning of each half-sweep of ALS is denoted by 
a circle. To assessment the performance of both schemes as fairly as possible, we have taken 
considerable care to provide the same level of optimization to the implementations of both the ALS 
and the approximate Newton scheme. 

Mesh-dependence of the preconditioner. To investigate how the performance of the pre¬ 
conditioner depends on the mesh width of the discretization, we look again at the anisotropic 
diffusion operator and measure the convergence as the mesh width h and therefore the tensor size 
n G {60,120,180, 240, 360,420,480, 540,600} changes by one order of magnitude. As in the test for 
quadratic convergence, we construct the right hand side by applying A to a random rank 3 tensor. 
For the dimension and tensor size we have chosen d = 3 and n = 200, respectively. 

To measure the convergence, we take the number of iterations needed to converge to a relative 
residual of 10“®. For each tensor size, we perform 30 runs with random starting guesses of rank 
r = 3. The result is shown in Figure where circles are drawn for each combination of size n and 
number of iterations needed. The radius of each circle denotes how many runs have achieved a 
residual of 10“® for this number of iterations. 

On the left plot of0 we see the results of dimension d = 10, whereas on the right plot we have 
d = 30. We see that the number of iterations needed to converge changes only mildly as the mesh 
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Tensor size n Tensor size n 

Figure 7: Number of iterations that the proposed approximate Newton scheme needs to reach a 
relative residual o/10“® for different mesh widths h — 1/n. The solution has dimension d = 10 and 
rank r = 3. We perform 30 runs for each size. The radii of the circles corresponds to the number 
of runs achieving this number of iterations. Left: Dimension d = 10. Right: Dimension d = 30. 


width varies over one order of magnitude. In addition, the dependence on d is also not very large. 

5.5 Rank-adaptivity 

Note that in many applications, rank-adaptivity of the algorithm is a desired property. For the 
Richardson approach, this would result in replacing the fixed-rank truncation with a tolerance- 
based rounding procedure. In the alternating optimization, this would lead to the DMRG or AMEn 
algorithms. In the framework of Riemannian optimization, rank-adaptivity can be introduced by 
successive runs of increasing rank, using the previous solution as a warm start for the next rank. 
For a recent discussion of this approach, see |60j . A basic example of introducing rank-adaptivity 
to the approximate Newton scheme is shown in Figure Starting from ranks r^^^ = I, we run 
the approximate Newton scheme for 10 iterations and use this result to warm start the algorithm 
with ranks -I- 5. At each rank, we perform 10 iterations of the approximate Newton 

scheme. The result is compared to the convergence of approximate Newton when starting directly 
with the target rank r^’'\ We see that the obtained relative residuals match for each of the ranks 
Although the adaptive rank scheme is slower for a desired target rank due to the additional 
intermediate steps, it offers more flexibility when we want to instead prescribe a desired accuracy. 
For a relative residual of 10“^, the adaptive scheme needs about half the time than using the (too 
large) rank r = 36. 

Note that in the case of tensor completion, rank adaptivity becomes a crucial ingredient to 
avoid overfitting and to steer the algorithm into the right direction, see e.g. [sn EH Eg Eoi Eg. 
For difficult completion problems, careful core-by-core rank increases become necessary. Here, for 
linear systems, such a core-by-core strategy does not seem to be necessary, as the algorithms will 
converge even if we directly optimize using rank r = 36. This is likely due to the preconditioner 
which acts globally over all cores. 
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Figure 8: Rank-adaptivity for approximate Newton applied to the anisotropic diffusion equation with 
n = 100, d = 10. Starting from rank 1, the rank is increased by 5 after 10 iterations per rank. Each 
rank increase is denoted by a black circle. The other curves show the convergence when running 
approximate Newton directly with the target rank. 

6 Conclusions 

We have investigated different ways of introducing preconditioning into Riemannian gradient de¬ 
scent. As a simple but effective approach, we have seen the Riemannian truncated preconditioned 
Richardson scheme. Another approach used second-order information by means of approximating 
the Riemannian Hessian. In the Tucker case, the resulting approximate Newton equation could be 
solved efficiently in closed form, whereas in the TT case, we have shown that this equation can 
be solved iteratively in a very efficient way using PCG with an overlapping block-Jacobi precondi¬ 
tioner. The numerical experiments show favorable performance of the proposed algorithms when 
compared to standard non-Riemannian approaches, such as truncated preconditioned Richardson 
and ALS. The advantages of the approximate Newton scheme become especially pronounced in 
cases when the linear operator is expensive to apply, e.g., the Newton potential. 
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